PSCI 8357 - STAT II
Department of Political Science, Vanderbilt University
February 1, 2026
This week we will see, that we might use regression agnostically to estimate causal estimands as well.
BUT this only solves the estimation problem.
Problem: If we want to learn about relationship between \(X\) and \(Y\)
CEF
The CEF, \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\), is the expected value of \(Y_i\) across values of \(X_i\):
For continuous \(Y_i\) \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \int_{\mathcal{Y}} y f(y {\:\vert\:}X_i) \, dy \]
For discrete \(Y_i\): \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \sum_{\mathcal{Y}} y p(y {\:\vert\:}X_i) \]
CEF Decomposition Property
\[ Y_i = \underbrace{{\mathbb{E}}[Y_i {\:\vert\:}X_i]}_{\text{explained by $X_i$}} + \underbrace{\varepsilon_i}_{\text{unexplained}}, \]
where \({\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] = 0\) and \(\varepsilon_i\) is uncorrelated with any function of \(X_i\)
To see this property recall
\[ \begin{align*} \varepsilon_i &= Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] \quad \implies\\ {\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] &= {\mathbb{E}}[Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] {\:\vert\:}X_i] = 0 \end{align*} \]
also \({\mathbb{E}}[h(X_i) \varepsilon_i] = 0\). (How can we use Law of Iterated Expectations to prove this?)
CEF Prediction Property
\[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = {\arg\!\min}_{g(X_i)} {\mathbb{E}}\left[ (Y_i - g(X_i))^2 \right], \] where \(g(X_i)\) is any function of \(X_i\).
\[ \begin{align*} (Y_i - g(X_i))^2 &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] + {\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2 \\ &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)^2 + 2\left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)\left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right) \\ &\quad + \left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2. \end{align*} \]
Density distributions show the spread of \(Y\) values at each discrete \(X\); black line connects the conditional means.
The CEF properties we just established are important because:
Decomposition: Any outcome can be split into a systematic part (explained by covariates) and noise.
Optimality: The CEF is the best predictor of \(Y_i\) given \(X_i\) in the MSE sense.
The \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) quantity looks very familiar, we already used in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\).
We want to see if regression helps us with estimating these quantities. Especially when we want to estimate differences in means.
Note: There is nothing causal in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\), so we still need identification
Before we move on to this, we need to recall important facts about regression coefficients
Population regression coefficients vector is given by (directly follows from \({\mathbb{E}}[X_i \varepsilon_i] = 0\)) \[ \beta = {\mathbb{E}}[X_i X_i^{\prime}]^{-1} {\mathbb{E}}[X_i Y_i] \]
Regression coefficient in single covariate case is given by (population and sample analog) \[ \beta = \frac{{\mathrm{cov}}(Y_i,X_i)}{{\mathbb{V}}(X_i)}, \quad \widehat{\beta} = \frac{\sum_{i = 1}^{n} (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i = 1}^{n} (X_i - \bar{X})^2} \]
Regression coefficient in multiple covariate case is given by \[ \beta_{k} = \frac{{\mathrm{cov}}(\tilde{Y}_i,\tilde{X}_{ki})}{{\mathbb{V}}(\tilde{X}_{ki})}, \] where \(\tilde{X}_{ki}\) is the residual from regressing \(X_k\) on \(X_{-k}\)
Theorem: Linear CEF
If CEF \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) is linear in \(X_i\), then the population regression function \(X_i^{\prime} \beta\) returns exactly \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\).
To see this property we can
Use decomposition property of CEF to see \({\mathbb{E}}[ X_i (Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]) ] = 0\)
Substitute for \({\mathbb{E}}[Y_i {\:\vert\:}X_i] = X_i^{\prime} b\) and solve
CEF Prediction Property
The function \(X_i' \beta\) provides the Minimal MSE linear approximation to \({\mathbb{E}}[Y_i | X_i]\), that is:
\[ \beta = {\arg\!\min}_b {\mathbb{E}}\left[ ({\mathbb{E}}[Y_i | X_i] - X_i' b)^2 \right]. \]
\[ \begin{align*} (Y_i - X_i' b)^2 &= \left( (Y_i - {\mathbb{E}}[Y_i | X_i]) + ({\mathbb{E}}[Y_i | X_i] - X_i' b) \right)^2 \\ &= (Y_i - {\mathbb{E}}[Y_i | X_i])^2 + ({\mathbb{E}}[Y_i | X_i] - X_i' b)^2 \\ &\quad + 2 (Y_i - {\mathbb{E}}[Y_i | X_i]) ({\mathbb{E}}[Y_i | X_i] - X_i' b). \end{align*} \]
Suppose \(\mathcal{T} = \{0, 1\}\)
Under SUTVA (no interference and consistency) POs are \(Y_{i} (1)\) and \(Y_{i} (0)\).
A unit-level treatment effect is, \(\tau_i = Y_{i} (1) - Y_{i} (0)\)
We observe \(X_i\), \(T_i\) and, \(Y_i = T_i Y_{i} (1) + (1 - T_i )Y_{i} (0)\).
In this simple case OLS estimator solves the least squares problem:
\[ (\widehat{\tau}, \widehat{\alpha}) = {\arg\!\min}_{\tau, \alpha} \sum_{i=1}^n \left(Y_i - \alpha - \tau T_i\right)^2 \]
Coefficient \(\tau\) is algebraically equivalent to the difference in means (\(\tau_{DiM}\)):
\[ \widehat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \widehat{\tau}_{DiM} \]
\[ \begin{align*} Y_i &= T_i Y_i(1) + (1 - T_i) Y_i(0) \\ &= Y_i(0) + T_i ( Y_i(1) - Y_i(0) ) \quad\text{($\because$ distribute)}\\ &= Y_i(0) + \tau_i T_i \quad \text{($\because$ unit treatment definition)}\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + ( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (\tau_i - \tau) \quad (\because \pm {\mathbb{E}}[Y_i(0)] + \tau T_i)\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \quad\text{($\because$ distribute)}\\ &= \alpha + \tau T_i + \eta_i \end{align*} \]
Linear functional form fully justified by SUTVA assumption alone:
\[ \eta_i = (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \]
\[ \begin{align*} {\mathbb{E}}[\eta_i {\:\vert\:}T_i] &= {\mathbb{E}}[(1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) {\:\vert\:}T_i] \\ &= (1 - T_i) ({\mathbb{E}}[Y_i(0) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(0)]) + T_i ({\mathbb{E}}[Y_i(1) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(1)]) \end{align*} \]
Randomization + consistency allow linear model.
Does not imply homoskedasticity or normal errors, though!
Practical implication: Use heteroskedasticity-robust (HC2) standard errors for inference, e.g. via lm_robust().
Under strong ignorability we can use regression to estimate causal effects of interest.
What if instead we assume that selection depends on a set of observed covariates \(X_{i}\), i.e. there is selection on observables?
\[ \{ Y_i(0), Y_i(1) \} {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i. \]
\[ \eta_i {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \implies {\mathbb{E}}[\eta_i {\:\vert\:}T_i=1, X_i] = {\mathbb{E}}[\eta_i {\:\vert\:}T_i=0, X_i] = {\mathbb{E}}[\eta_i {\:\vert\:}X_i]. \]
\[ f_i(t) = \alpha + \tau t + \eta_i \]
Observed outcomes are given by:
\[ Y_i = f_i(T_i) = \alpha + \tau T_i + \eta_i, \]
where \(\eta_i\) captures all variable determinants of \(f_i(T_i)\) other than \(T_i\).
We also allow \(\eta_i\) to depend on covariates we identified:
\[ \eta_i = X_i^{\prime} \gamma + \nu_i, \]
where \(\gamma\) is the population regression solution.
Orthogonality of residuals to regressors in the population implies \({\mathbb{E}}[X_i \nu_i] = 0\).
\[ {\mathbb{E}}[\eta_i | X_i] = X_i^\prime \gamma \]
Since when \(X_i\) is fixed only \(\nu_i\) is varying in \(\eta_i\), under this model we have
\[ f_i(t) {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \implies \nu_i {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i, \]
\[ Y_i = \alpha + \tau T_i + X_i^{\prime} \gamma + \nu_i, \]
where \(\nu_i\) is uncorrelated with \(X_i\) and also with \(T_i\) conditional on \(X_i\).
\[ {\mathbb{E}}[f_i(t) {\:\vert\:}T_i = t, X_i] = {\mathbb{E}}[f_i(t) {\:\vert\:}X_i] = \alpha + \tau t + X_i^{\prime} \gamma. \]
\[ \begin{align*} {\mathbb{E}}[f_i(t) &- f_i(t - v) {\:\vert\:}X_i] \\ &= (\alpha + \tau t + X_i^{\prime} \gamma) - (\alpha + \tau (t - v) + X_i^{\prime} \gamma) \\ &= \tau v \end{align*} \]
Result:
\[ \begin{align*} {\mathrm{cov}}(Y_i, T_i) &= {\mathrm{cov}}(\alpha + \tau T_i + X_i' \gamma + \nu_i,\, T_i) \\ &= \tau {\mathrm{cov}}(T_i, T_i) + {\mathrm{cov}}(X_{1i} \gamma_1 + \ldots + X_{Ki} \gamma_K, T_i) \\ &= \tau {\mathbb{V}}(T_i) + \gamma_1 {\mathrm{cov}}(X_{1i}, T_i) + \ldots + \gamma_K {\mathrm{cov}}(X_{Ki}, T_i) \end{align*} \]
\[ \implies \frac{{\mathrm{cov}}(Y_i, T_i)}{{\mathbb{V}}(T_i)} = \tau + \underbrace{\gamma^{\prime} \delta}_{\text{OVB}} \]
where \(\delta\) are coefficients from regressions of \(X_1, \ldots, X_K\) on \(T_i\).
OVB = \(\gamma^\prime \delta\), where
Same holds when we consider the case where we include some controls:
\[ \text{OVB} = \tilde{\gamma}' \tilde{\delta}. \]
Everything is just defined in terms of variables that have been residualized with respect to the included controls.
OVB = confounder impact \(\times\) imbalance (Cinelli and Hazlett 2020).
Let’s practice applying the OVB formula:
OVB = \((X_{ki}, Y_i)\) relationships \(\times\) \((X_{ki}, T_i)\) relationships
Effect of democratic institutions on growth, estimated via regression of growth on democratic institutions.
Effect of exposure to negative advertisements on turnout, estimated via regression of turnout on the number of ads seen.
set.seed(20250127) # set seed
n <- 1000 # sample size
tau <- 0.5 # ATE
gamma <- 0.3 # effect of confounder on outcome
delta <- 0.3 # proportional to effect of treatment on confounder (only if less than sd ratio)
# confounder
confounder <- rnorm(n, mean = 50, sd = 10)
# democratic institutions (correlated with confounder)
democracy_score <- delta * confounder + rnorm(n, mean = 0, sd = 5)
# economic growth (influenced by both investment and democratic institutions)
growth <- tau *
democracy_score +
gamma * confounder +
rnorm(n, mean = 0, sd = 5)
# true regression including the confounder
model_unbiased <- lm(growth ~ democracy_score + confounder)
cat("Unbiased model error:", unname(model_unbiased$coefficients[2]) - tau, "\n")Unbiased model error: -0.01922573
Biased model error: 0.3081032
Omitted variables is a misleading term because it could suggest that you want to include any variable that is correlated with treatment and outcomes.
But remember bad controls exist, e.g.
The discussion of OVB suggests that we can use regression to adjust for variables (\(X_i\)) to estimate the treatment effect (\(\tau\)) in two ways.
Long regression: Include covariates \(X_i\) directly in the regression model.
Residualized regression:
Result: Coefficient on \(T_{i}\) in long regression and on \(\tilde{T}_i\) in residualized regression are identical.
Nodes: \(T\), \(Y\), \(Z_1\), \(Z_2\), and \(Z_3\).
Paths: \(T \to Y\), \(T \leftarrow Z_3 \to Y\), \(T \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to Y\), etc.
\(Z_1\) is a parent of \(T\) and \(Z_3\).
\(T\) and \(Z_3\) are children of \(Z_1\).
\(Z_1\) is an ancestor of \(Y\).
\(Y\) is a descendant of \(Z_1\).
Definition: Blocked Paths
A set of nodes \(X\) blocks a path \(p\) if either:
Definition: \(d\)-separation
If \(X\) blocks all paths from \(T\) to \(Y\), then \(X\) \(d\)-separates \(T\) and \(Y\).
If \(X\) \(d\)-separates \(T\) and \(Y\), then \(Y {\mbox{$\perp\!\!\!\perp$}}T {\:\vert\:}X\).
Theorem: The Back-Door Criterion
A set \(X\) is sufficient for adjustment to identify the causal effect of \(T\) on \(Y\) if:
Follow Cinelli, Forney, and Pearl (2024) which provides a systematic framework for thinking about control variables.
Key insight: Not all variables that are correlated with treatment and outcome should be controlled for!
We will classify controls as:
In model (a) reduction in variation is good! \(\rightarrow\) higher precision
In model (b) reduction in variation is bad! \(\rightarrow\) lower precision
In model (c) reduction in variation is good again! \(\rightarrow\) higher precision
In models (a) and (b) controlling for \(Z\) unblocks back-door paths and induces relationship between \(X\) and \(Y\).
In models (c) and (d) controlling for \(Z\) will unblock the back-door path \(X \rightarrow Z \leftarrow U \rightarrow Y\).
In models (a) and (b) controlling for \(Z\) blocks the causal path.
In model (c) controlling for \(Z\) blocks part of the causal path.
In model (d) controlling for \(Z\) will not block the causal path or induce any bias.
To see the intuition behind post-treatment bias consider the following example
Suppose \(X = 0, 1\) randomly assigned, and then
\[ \begin{align*} Z &= X + \varepsilon_Z, \\ Y &= \beta X + \gamma Z + \varepsilon_Y, \end{align*} \]
where \(\varepsilon_Z\) and \(\varepsilon_Y\) are independent standard normal draws.
Substituting in \(Y\):
\[ Y = (\beta + \gamma)X + \gamma \varepsilon_Z + \varepsilon_Y \]
Effect of \(X\) on \(Y\) is \(\beta + \gamma\).
Controlling for \(Z\), we would estimate an effect of \(\beta\).
The bias, \(-\gamma\), is the portion of the effect that has been “stolen away” by conditioning on \(Z\).
Be mindful of what controls you include in your analysis (even if it is an experiment).
Draw a DAG with controls you plan to include and see whether
Be also mindful of the sizes of the effects of potential confounders. If the effect on main independent and dependent variable can be proven to be limited, the OVB is small!
These are strong assumptions!
What if they are false? Let’s see.
Suppose
Let conditional independence assumption (CIA) hold: \(Y_{i} (0), Y_{i} (1)) {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\).
In this case the effect of interest is still
\[ \begin{align*} \tau_{ATE} &= {\mathbb{E}}_{X} [{\mathbb{E}}[Y_i (1) {\:\vert\:}X_i] - {\mathbb{E}}[Y_i (0) {\:\vert\:}X_i]] \\ &= {\mathbb{E}}_{X} [{\mathbb{E}}[Y_i (1) {\:\vert\:}T_i = 1, X_i] - {\mathbb{E}}[Y_i (0) {\:\vert\:}T_i = 0, X_i]] \\ &= \sum_{X} \tau_x {\textrm{Pr}}(X_i = x), \end{align*} \]
where \(\tau_x \equiv {\mathbb{E}}[Y_i (1) {\:\vert\:}X_i = x] - {\mathbb{E}}[Y_i (0) {\:\vert\:}X_i = x]\)
We can use saturated (or one-way fixed effects) OLS regression model
\[ Y_i = \alpha_0 + \tau T_i + \mathbb{1} [X_i = x_2] \alpha_{x_2} + \dots + \mathbb{1} [ X_i = x_L ] \alpha_{x_L} + \varepsilon_i, \]
where \(\mathbb{1} [\cdot]\) denotes indicator of event \(\cdot\); \(x_2, \dots, x_L\) exhausts all possible \(X_i\) values omitting one (?) from the specification.
Recall regression anatomy: \(\widehat{\tau} = \frac{{\mathrm{cov}}(\tilde{Y}_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)}\), where \(\tilde{T}_i\) is residuals from regression of \(T_i\) on other regressors
Let’s see if it actually works
# simulate data
n <- 1000
X <- rnorm(n)
D <- 0.5 * X + rnorm(n) # do not use T!!!
Y <- 2 * D + 1 * X + rnorm(n)
# standard regression
standard <- coef(lm(Y ~ D + X))["D"]
# make Y tilde and D tilde
tilde_Y <- lm(Y ~ X)$residuals
tilde_D <- lm(D ~ X)$residuals
# regression anatomy
anatomy <- coef(lm(tilde_Y ~ tilde_D))["tilde_D"]
# simplified regression anatomy
anatomy_simp <- coef(lm(Y ~ tilde_D))["tilde_D"]
data.frame(
Method = c("Standard", "Regression Anatomy",
"Regression Anatomy (Simplified)"),
Coefficient = c(standard, anatomy, anatomy_simp)
) |>
knitr::kable(digits = 3)| Method | Coefficient |
|---|---|
| Standard | 1.978 |
| Regression Anatomy | 1.978 |
| Regression Anatomy (Simplified) | 1.978 |
Recall that \(\widehat{\tau} = \frac{{\mathrm{cov}}(Y_i, \tilde{T_i})}{{\mathbb{V}}(\tilde{T_i})}\), where \(\tilde{T_i}\) is residuals from regression of \(T_i\) on other regressors
\[ \begin{align*} \widehat{\tau} &= \frac{{\mathrm{cov}}(Y_i, \tilde{T_i})}{{\mathbb{V}}(\tilde{T_i})}\\ &= \frac{{\mathbb{E}}[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i](T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])] - \textcolor{#d65d0e}{{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]{\mathbb{E}}[T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]]} ] }{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]}\\ &= \frac{{\mathbb{E}}[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i](T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]}. \quad \text{($\because$ independence of residuals)} \end{align*} \]
Now let’s look at the first term in the numerator
\[ \begin{align*} {\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] &= T_i{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i, X_i] + \textcolor{#458588}{{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i, X_i]} \quad \text{($\because$ switching equation)}\\ &= T_i{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i, X_i] + {\mathbb{E}}[\textcolor{#458588}{Y_{i} (0)} {\:\vert\:}T_i = 0, X_i] \quad \text{($\because$ CIA)}\\ &= T_i \textcolor{#458588}{{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i, X_i]} + {\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i] \quad \text{($\because$ switching equation)}\\ &= T_i \tau_X + {\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i] \quad \text{($\because$ definition of $\tau_X$)} \end{align*} \]
\[ \begin{align*} \widehat{\tau} &= \frac{{\mathbb{E}}[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i](T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \\ &= \frac{{\mathbb{E}}[\textcolor{#458588}{\left( T_i \tau_X + {\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i] \right)} (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \quad \text{($\because$ plug in result)}\\ &\class{fragment}{{}= \frac{{\mathbb{E}}[ \textcolor{#458588}{T_i \tau_X} (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]) + \textcolor{#458588}{{\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i]} (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]) ]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \quad \text{($\because$ distribute)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}[ \textcolor{#d65d0e}{T_i} \tau_X \textcolor{#d65d0e}{(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])} ]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \quad \text{($\because$ independence of residuals)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X [ \tau_X \textcolor{#d65d0e}{{\mathbb{E}}[ T_i^2 - T_i {\mathbb{E}}[T_i {\:\vert\:}X_i] {\:\vert\:}X_i]} ]}{{\mathbb{E}}_X [ \textcolor{#d65d0e}{{\mathbb{E}}[ (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2 {\:\vert\:}X_i ]} ]} \quad \text{($\because$ iterated ${\mathbb{E}}$ )}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X [ \tau_X \textcolor{#d65d0e}{{\mathbb{V}}(T_i {\:\vert\:}X_i)} ]}{{\mathbb{E}}_X [ \textcolor{#d65d0e}{{\mathbb{V}}(T_i {\:\vert\:}X_i)} ] } \quad \text{($\because$ definition of ${\mathbb{V}}$ )}}\\ &\class{fragment}{{}= \frac{\sum_X \tau_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)}{\sum_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)}. \quad \text{($\because$ binary $T_i$)}} \end{align*} \]
Compare
\[ \tau_{ATE} = \sum_{X} \tau_x {\textrm{Pr}}(X_i = x), \]
versus
\[ \widehat{\tau} = \frac{\sum_X \tau_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)} {\sum_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)} \]
\(\widehat{\tau}\) aggregates via conditional variance weighting with respect to \(T_i\) instead of just probability.
If \(\tau_i\) was constant wrt \(X_i\), the precision weighting could be good from an efficiency standpoint.
If \(T_i {\mbox{$\perp\!\!\!\perp$}}X_i\), then \(\widehat{\tau}\) reduces to weighting by the number of units with \(X_i = x\).
Logic carries through to continuous treatments (Angrist and Pischke 2009, 77–80; Aronow and Samii 2016).
Aronow and Samii (2016) show that for arbitrary \(T_i\) and \(X_i\),
\[ \widehat{\tau} \xrightarrow{p} \frac{{\mathbb{E}}[w_i \tau_i]}{{\mathbb{E}}[w_i]}, \quad \text{where } w_i = (T_i - {\mathbb{E}}[T_i | X_i])^2, \]
in which case
\[ {\mathbb{E}}[w_i | X_i] = {\mathbb{V}}[T_i {\:\vert\:}X_i]. \]
The effective sample is weighted by \(\widehat{w}_i = (T_i - \widehat{{\mathbb{E}}}[T_i | X_i])^2\) (squared residual from regression of \(T_i\) on covariates).
Even with a representative sample, regression estimates may not aggregate effects in a representative manner. Regression estimates are local to an effective sample.
set.seed(20250202) # set seed
n <- 1000 # sample size
tau_base <- 0.5
gamma <- 0.1 # effect of X on outcome
# some discrete covariate
X <- sample(x = 1:100, size = n, replace = T)
# total treatment effect (assuming possible heterogeneity)
tau_total <- sum((tau_base + 0.01 * 1:100) / 100)
# democratic institutions (correlated with confounder)
democracy_high <- rbinom(n, size = 1, prob = .5)
democracy_high_2 <-
rbinom(n, size = 1, prob = sapply(X, function(x) .5 + 0.01 * x))
# economic growth (influenced by both investment and democratic institutions)
growth <-
(tau_base + 0.01 * X) *
democracy_high +
gamma * X +
rnorm(n, mean = 0, sd = 5)
growth_2 <-
(tau_base + 0.01 * X) *
democracy_high_2 +
gamma * X +
rnorm(n, mean = 0, sd = 5)
# regression ignoring the confounder
bias1 <- lm(growth ~ democracy_high + factor(X))$coefficients[2] - tau_total
# regression ignoring the confounder
bias2 <- lm(growth_2 ~ democracy_high_2 + factor(X))$coefficients[2] - tau_totalConsider a multiple regression model: \(Y_i = \alpha + \tau T_i + X_i^\prime \beta + \nu_i\).
To find \(\tau\), the coefficient on \(T_i\), the Frisch-Waugh-Lovell Theorem states that:
Regress \(Y_i\) on \(X_i\) and obtain the residuals \(\tilde{Y}_i = Y_i - X_i^\prime \widehat{\beta}\).
Regress \(T_i\) on \(X_i\) and obtain the residuals \(\tilde{T}_i = T_i - X_i^\prime \widehat{\delta}\).
Regress \(\tilde{Y}_i\) on \(\tilde{T}_i\) to obtain \(\beta_1\).
In addition, the \(R^2\) and F-statistics of these regressions will be the same as those from the full model regression.
Intuition: